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Abstract —Feature selection in learning to rank has recently 
emerged as a crucial issue. Whereas several preprocessing ap¬ 
proaches have been proposed, only a few works have been focused 
on integrating the feature selection into the learning process. In 
this work, we propose a general framework for feature selection 
in learning to rank using SVM with a sparse regularization term. 
We investigate both classical convex regularizations such as 
or weighted F and non-convex regularization terms such as log 
penalty. Minimax Concave Penalty (MCP) or pseudo norm 
with p < 1. Two algorithms are proposed, first an accelerated 
proximal approach for solving the convex problems, second a 
reweighted E scheme to address the non-convex regularizations. 
We conduct intensive experiments on nine datasets from Letor 
3.0 and Letor 4.0 corpora. Numerical results show that the use of 
non-convex regularizations we propose leads to more sparsity in 
the resulting models while prediction performance is preserved. 
The number of features is decreased by up to a factor of six 
compared to the regularization. In addition, the software is 
publicly available on the webQ 

Index Terms —Feature selection. Learning to rank. Regularized 
SVM, Sparsity, FBS algorithms, non-convex regularizations. 


I. Introduction 

L earning to rank is a crucial issue in the field of 
information retrieval (IR). The main goal of learning 
to rank is to learn automatically ranking functions using a 
machine learning algorithm, in order to optimize the ranking 
of documents or web pages. Several algorithms have been 
proposed during the past decade III that can combine a very 
large amount of features to learn ranking functions. 

Whereas the number of features that can be used by 
algorithms have increased, the issue of feature selection in 
learning to rank has emerged, for two main reasons. 
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First, as more and more features are incorporated into 
algorithms, not only the models become more difficult to 
understand, but also, they potentially have to deal with more 
and more noisy or irrelevant features. As feature selection 
is well-known in machine learning to deal with noisy and 
irrelevant features, it is seen as a quite natural way to solve 
this problem in learning to rank. 

Second, the amount of training data used in learning to 
rank is substantial. As a consequence, learning a ranking 
function using algorithms is generally costly and can be time- 
consuming. Reducing the number of features, and thus the 
dimensionality of the problem, is a promising way to handle 
the issue of high computational cost. 

Recent works have focused on the development of feature 
selection methods dedicated to learning to rank, which can be 
either preprocessing steps such as filter Enaii and wrapper 
approaches la El El 13 or integrated to the learning algorithm, 
such as embedded approaches ISl ll9l iflOl . In the latter case, 
the learning algorithm is called a sparse algorithm. In this 
paper, we consider an embedded approach for feature selection 
in learning to rank. We propose a general framework for 
feature selection in learning to rank using Support Vector 
Machines (SVM) and a regularization term to induce sparsity. 
We investigate both convex regularizaions such as F na 
and non-convex regularizations such as MCP na, log or 
£p, p < 1 IIT3II . To the best of our knowledge, this is the 
first work that investigates the use of non-convex penalties 
for feature selection in learning to rank. We first propose an 
accelerated Forward Backward Splitting algorithm in order 
to solve the -regularized problem. Then, we propose a 
reweighted £i algorithm to handle the non-convex penalties 
that benefits from the first algorithm. We conduct intensive 
experiments on the Letor 3.0 and 4.0 corpora. Our convex 
algorithm leads to similar performance than the state-of-the- 
art methods. We show that the second algorithm that uses non- 
convex regularizations is a very competitive feature selection 
method, since it provides as good results as convex approaches 
but is much more performant in terms of sparsity. Indeed, it 
provides similar values of evaluation measures while using 
half as many features in average. 

This paper is organized as follows. Section 2 presents the 
state of the art for learning to rank algorithms, feature selec¬ 
tion methods, sparse SVM and Forward Backward Splitting 
approaches. We formulate the optimization problem in section 
3. Section 4 introduces the algorithms used to solve the 
optimization problems. We fully describe the datasets used 
and the experimental protocol in section 5. In section 6, we 
firstly analyze the ability of our approach to induce sparsity 
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into models. We secondly evaluate the performance of our 
framework in terms of MAP and NDCG@10. We confront 
these results to those obtained with two recent embedded 
feature selection methods. 

II. Related works 

Our work focuses on feature selection in learning to rank. 
We begin this section by presenting existing learning to 
rank algorithms. We provide an overview of feature selection 
methods dedicated to learning to rank and introduce feature 
selection using sparse regularized SVM. 

A. Learning to rank algorithms 

The learning to rank process consists in a training phase 
and a prediction phase. In IR, the training data are composed 
of query-documents pairs represented by feature vectors. A 
relevance judgement between the query and the document 
is given as ground truth. The purpose of the training phase 
is to learn a model that provides the optimum ranking of 
documents according to their relevance to the query. The 
ability of the model to correctly rank documents for new 
queries is then evaluated during the prediction phase, on 
test data. Following is a short overview of learning to rank 
approaches and algorithms. A more complete introduction 
to learning to rank for IR can be found in the book by Liu ifTl . 

Three approaches, called pointwise, pairwise and listwise, 
have been proposed to solve the learning to rank problem. In 
the pointwise approach, each instance is a vector of features 
Xi which represents a query-document pair. The ground truth 
can be either a relevance score s G K or a class of relevance 
(such as ’’not relevant”, ’’quite relevant”, ’’highly relevant”). 
When dealing with a relevance score, learning to rank is seen 
as a regression problem. Some algorithms such as Subset 
Ranking iflTll have been proposed to solve it. When dealing 
with classes of relevance, learning to rank is considered 
as a classihcation problem or as an ordinal regression 
problem, depending on whether there is an ordinal relation 
between the classes of relevance. Some algorithms based on 
SVM ca or on boosting m deal with the classihcation 
problem. Crammer and Singer ini proposed an algorithm 
for ordinal regression. In the pairwise approach, also referred 
as preference learning M, each instance is a pair of feature 
vectors (xi, xj) for a given query q. The ground truth is given 
as a preference y G { — 1,1} between the two documents. 
For a given couple {xi,Xj), if Xi is preferred to Xj, we note 
Xi >q Xj and then y is set to 1. In the contrary, if Xj is 
preferred to Xi, we note Xj >-q Xi and then y is set to —1. It 
is thus a classihcation problem. Many algorithms have been 
developed to deal with this problem, such as RankNet ifT^ 
based on neural networks, RankBoost Ho) based on boosting 
or RankSVM-Primal ll^ and RankSVM-Struct ll22l based 
on SVM. Finally, the listwise approach considers the whole 
ranked list of documents as the instance of the algorithm. 
Most works have focused on the proposal of new specihc 
loss functions, based on the optimization of an IR metric or 
on permutations count in order to solve this kind of problem 


nnEa. 

These approaches have been shown to be both efficient 
and effective to learn functions that ensure high ranking 
performance in terms of IR measures. Nevertheless, they may 
be suboptimal for use in real life with large scale data. Ranking 
functions deal with a very large amount of features, which 
raises three critical issues. First, as features may take time to 
compute, preprocessing steps such as the creation of training 
data may become time-consuming. Second, due to the high 
dimensionality of training data, algorithms may not be scalable 
or they may take too much time for computation. Finally, 
there may be a significant amount of redundant or irrelevant 
features used by models, that can lead to suboptimal ranking 
performance. Thus, how to reduce the number of features 
to be used by algorithms has emerged as a crucial issue. 
Nevertheless, only few attempts have been made to solve this 
problem. In the following section, we propose an overview 
of existing feature selection methods in classihcation and 
learning to rank. 


B. Feature selection methods in learning to rank 

In classihcation, there are three kinds of feature selection 
methods called hlter, wrapper and embedded. In hlter 
methods, a subset of features is selected as a preprocessing 
step, independently of the predictor used for learning. In 
wrapper methods, the machine learning algorithm is used as 
a black box to score subsets of features according to their 
predictive power. The subset with the highest score is then 
chosen. Finally, in embedded methods, feature selection is 
performed within the training phase and incorporated to the 
algorithm. Embedded methods are generally specihc to a 
given machine learning algorithm. A wide introduction to 
feature selection for classihcation is presented in the work 
of Guyon and Elisseeff 1^ . Eeature selection methods for 
learning to rank have been developed in a similar way as in 
classihcation. We propose an overview of feature selection 
methods for learning to rank in the following section and we 
classify them into hlter, wrapper and embedded categories in 
table U 


To the best of our knowledge, the hrst proposal of a feature 
selection method dedicated to learning to rank is the work 
of Geng et al. 12. Their method is called Greedy search 
Algorithm for feature Selection (GAS) and belongs to hlter 
approaches. For each feature, they hrst dehne its importance 
score; they rank instances according to feature values and eval¬ 
uate the performance of the ranking list with a measure such 
as Mean Average Precision (MAP) or Normalized Discounted 
Cumulative Gain (NDCG). This evaluation measure is then 
used as the importance score for the feature. For each pair 
of features, they also dehne a similarity score, which is the 
value of the Kendall’s r between the rankings induced by the 
features of the pair. The Kendall’s r is dehned as follow: if x 
and y are two features and V the number of documents pairs, 
then Tix,y) = 

indicates that the document dt is ranked above the document 
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ds according to the value of feature x. An optimization prob¬ 
lem is then formulated to select features by simultaneously 
maximizing the total importance score and minimizing the 
total similarity score. This optimization problem is solved by 
a greedy search algorithm. They show that the GAS algorithm 
can significantly improve the performance in terms of MAP 
or NDCG while reducing the number of features. 

Hua et al. 0 later proposed a two-phase feature selection 
strategy. In a first step, they define the similarity between 
features in the same way as in the GAS algorithm. Features 
are then clustered into groups according to their similarity, 
by using a k-means approach. The number of clusters to be 
used is chosen according to a quality measure defined by 
the authors. In a second step, they propose to select a single 
representative feature from each cluster to learn the model. 
They use two delegation strategies for this purpose: a filter 
one based on evaluation measure (BEM) and a wrapper one 
implied by the learning to rank method used (ILTR). The 
BEM delegation method selects the feature which ranking 
has the best evaluation score. The ILTR delegation method 
learns a linear model using a learning to rank algorithm. 
Eor each cluster, the representative feature is then the one 
with the highest weight in the ranking function. They show 
that BEM and ILTR techniques can significantly improve the 
performance in terms of NDCG@10 compared to models 
with no feature selection. 

Some other works have focused on the development of 
wrapper approaches for feature selection on learning to rank. 

Pan et al. Q proposed a method using boosted regression 
trees. In a similar way to El, they define an importance 
score for each feature and a similarity score for each pair 
of features. The importance score is the relative importance 
score as defined by Eriedman ll26l for regression boosted trees. 
The similarity score is defined by the Kendall’s r between the 
vectors of values for the features of the pairs. The authors 
investigate three optimization problems: (1) to maximize the 
importance score, (2) to minimize the similarity score and 
(3) to simultaneously maximize the importance score and 
minimize the similarity score. These optimization problems 
are solved by a greedy approach. Experiments show that 
better results are obtained when only using the importance 
score than when using the importance and similarity scores. 
Moreover, they point out that a 30 features model achieves 
similar performance in terms of NDCG@5 than the complete 
model with 419 features. In a second approach, they propose 
a randomized feature selection with a feature-importance- 
based backward elimination. In practice, they create subsets 
of features, then iteratively train boosted trees and remove 
a percentage of features according to their NDCG @5 per¬ 
formance. The experimental results show that these methods 
achieve comparable performance than the complete model by 
using only 30 features. 

Yu et al. in proposed two effective feature selection meth¬ 
ods for ranking based on Relief algorithms lIZTl . Relief algo¬ 
rithms are iterative methods that update the feature weights 
at each iteration, based on their importance. The authors 
propose RankWrapper, a wrapper approach for training data 


with relative orderings, andRankEilter, a filter approach from 
training data with multi-level relevance classes. They also 
define new updating rules for the weights for each algorithm. 
Experiments on synthetic and benchmark datasets show that 
their method outperforms the GAS algorithm and can be used 
with large scale datasets. 

Dang and Croft 121 proposed a feature selection technique 
based on the wrapper approach defined in 1^ . They use a 
best-first search procedure to create subsets of features. Eor 
each subset, they train a model with a ranking algorithm. 
The output is defined as a new feature. A new feature vector 
is then created with the output of each subset and contains 
less features than the initial dataset. Models are trained using 
this vector with four well-known learning to rank algorithms: 
RankNet, RankBoost, AdaRank and Coordinate Ascent. Their 
experiments on Letor datasets show that they produce compa¬ 
rable performance in terms of NDCG @5 by using the smaller 
feature vector. 

Einally, Pahikkala et al. E) proposed an algorithm called 
greedy RankRLS, which is a wrapper approach based on the 
existing RankRLS algorithm. Subsets of features are created 
on which a leave-query-out cross-validation is performed 
by using the RankRLS algorithm. Results on the Letor 4.0 
distribution show that the performance in terms of MAP and 
NDCG@10 are comparable to state-of-the-art algorithms with 
all the features. 

Recently, embedded methods have been proposed to deal 
with the problem of feature selection. These approaches in¬ 
troduce a sparse regularization term in the formulation of 
the optimization problem. Although sparse regularizations are 
widely used in classification to deal with feature selection, 
only a few attempts have been made to propose sparse- 
regularized learning to rank methods. Sun et al. lH) imple¬ 
mented a sparse algorithm called RSRank to directly optimize 
the NDCG. They propose a framework to reduce ranking 
to importance weighted pairwise classification. To achieve 
sparsity, they introduce a £i-regularization term and solve 
the optimization problem using truncated gradient descent. 
Experiments on Ohsumed and TD2003 datasets show that 
only about a third of features remained after the selection. 
Moreover, the performance of the learned model is comparable 
or significantly better than the baselines, depending on the 
dataset and the measure used. 

A more recent work of Lai et al. 13 proposed a primal-dual 
algorithm for learning to rank called EenchelRank. The authors 
formulate the sparse learning to rank problem as a SVM 
problem with a £i-regularization term. They use the properties 
of the Eenchel Duality to solve the optimization problem. 
Basically, EenchelRank is an iterative algorithm that works in 
three steps. At each iteration, it first checks whether the stop¬ 
ping criterion is satisfied. If not, the algorithm then greedily 
chooses a feature to update according to its value. Einally, it 
updates the weights of the ranking model. Experiments were 
conducted on several datasets from the Letor 3.0 and Letor 
4.0 collections. The authors show that EenchelRank leads to a 
good sparsity with sparsity ratios from 0.1875 to 0.5. It also 
provides comparable or significantly better results in terms of 
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TABLE I: Classification of feature selection algorithms for 
learning to rank into filter, wrapper and embedded categories. 


Filter approaches 

Wrapper approaches 

Emhedded approaches 

GAS (2) 

Hierarchical-ILTR (3 

RSRank (8) 

Hierarchical-BEM 

BRTree (3) 

FenchelRank 

RankFilter |4) 

RankWrapper (3 

FSMRank 


BFS-Wrapper (6) 
GreedyRankRLS (3 

[This work] 


MAP and NDCG than state-of-art algorithms and RSRank. 
Finally, Lai et al. cni recently proposed a new embedded 
algorithm for feature selection based on sparse SVM. This 
algorithm solves a joint convex optimization problem in order 
to learn ranking functions while automatically selecting the 
best features. They use a Nesterov approach to ensure fast 
convergence. They show that FSMRank can learn efficiently 
ranking models that outperform the GAS algorithm. 

In classification, a large panel of embedded methods have 
been developed to learn sparse models with SVM. As far 
as we know, FenchelRank and FSMRank are the only ones 
to use sparse SVM for feature selection in learning to rank. 
Sparse SVM could widely and efficiently be adapted in this 
purpose. In this paper, we focuses on SVM methods with 
sparse-regularized term for which we propose a short overview 
in the following section. 


C. Learning as regularized empirical loss minimization 


D. Feature selection via penalization 

When dealing with feature selection, the structural mini¬ 
mization principle still holds. In this case, the relevant penalty 
from the statistical point of view is the Iq regularization 
term. It is defined as r2o(w) = ||w||o = Z]j=i 
i.e. the number of non-zero components in vector w. But 
the minimization of such functional suffers severe drawbacks 
for optimization; it is not convex and not continuous nor 
differentiable (in zero). 

A way to tackle these issues is to relax the fg constraint by 
replacing by another penalty. A popular choice is the 

£i-regularization term. It has several advantages such as being 
convex and thus providing tractable optimization problems. 

However, if the lasso penalty alleviates the optimization 
issues raised by the in penalty, it brings some statistical 
concerns since it has been shown to be, in certain cases, 
inconsistent for variable selection and biased 1^ . A simple 
way to obtain nice statistical properties while preserving the 
computational efficiency of the minimization, is to use 
a weighted lasso penalty by assigning different weights to 
each coefficient. This leads to the weighted l\ regularization 

where the j3j > 0 are the data- 
dependent weights of each variable. Next step consists in 
proposing a suitable choice for these weights. 

For regression problems, 1291 proposed to use fdj = l/w^^ 
where is the non-regularized least square estimator. 

This approach cannot be applied in our framework, since 
the computation of the unpenalized solution is intractable. 
Instead we propose to derive these weights from a non-convex 
relaxation the pseudo-norm of the form 


The structural risk minimization is a useful and widely 
used induction principle in machine learning. It states that 
the learning task can be dehned as the following optimization 
problem (for a given positive constant C): 

71 

min C'Y' L(x7’w-I- 6, t/i)-I-H(w). (1) 

The loss function L(-, •) is the data fitting term measuring 
the discrepancy for all training examples {x^, yi] between the 
predicted value /(x) = x^w -|- 6 and the observed value y. 
The second term H( ) is a penalty providing regularization and 
controlling generalization ability through model complexity. 

Note that the training examples may be also expressed 
as a matrix X = [xi,...,x„]^ € and a vector 

y = [yi,..., G K" containing the output objective 

values. 

Many choices are possible for the loss function L{.,.) such 
as the hinge loss, the squared hinge-loss or the logistic loss 
for classification problems or the ^ 2 -loss or the Hubert loss 
for a regression problem. Similarly the regularization term 
H(-) can be the usual £ 2 -regularization also known as ridge 
regularization (H 2 (w) = ||w||| ) or the £ 1 -regularization term 
also known as lasso with Hi(w) = ||w||i = Irujl- The 
latter has been proposed in the context of linear regression 
by Tibishirani im in order to promote automated feature 
selection. 




( 2 ) 


i=i 


where g{-) are non-convex functions. Indeed, when it is well 
chosen, the non-convex nature of function g will provide 
good statistical properties such as unbiasedness and oracle 
inequalities m, while the use of a weighted £1 implemen¬ 
tation scheme will ensure nice computational behavior. This 
can be obtained because the associated problem have been 
shown to be part of a more general optimization framework 
that can be solved using a simple reweighed £1 scheme with 
/3j = p'(|w*|), as presented in section IV-C where g' denotes 


the derivative of g and w* is a previously computed solution 

111 . 

Among all possible choices for g, we can cite the £p pseudo¬ 
norm with p < 1 proposed by m and used more recently 
in compressed sensing applications ll32]| . Another well known 
approximation of the £0 penalty is the log penalty that have 
been introduced in ll33l in the context of variable selection. 
The Minimax Concave Penalty (MCP) has been proposed in 
m in order to minimize the bias introduced by classical 
£1 regularizations. The smoothly clipped absolute deviation 
(SCAD) is another popular choice. These regularization terms 
are plotted in Figure while their definition is recalled in the 
associated table. 

In section 3, we formulate the sparse regularized problem. 
In section 4, we propose two algorithms to solve this non 
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Fig. 1: Comparison of several nonconvex regularization terms, 
e = 1 and 7 = 1 are parameters respectively for the log and 
MCP regularizations. 


of documents that take part of a preference relation for query 
Qk- We can then define the following vectors ; 

Q = [ 1.^.1 ... k .k ... Q ... Q] 

card{I^} card{I>‘} card{lQ} 

where D S q ^ cardli'”}^ Then the 

subset V G of all preference relations in the training 

dataset T is defined as : 

V = {(s,f)s=r,....n;i=r.....n|(D(s),D(i)) e ^Q(s)=Q(t)} 

Each feature vector can be written as (^(.,.) = Xs,s = 1,..., n 
and X G is the matrix of all Xg vectors. The pairwise 

optimization problem is defined as: 

min;T||w||^ + C'^^p (4) 

P=1 

under the constraints 

I XpW > 1 - Cp 

Up>o Vp = i,...,p 


differentiable problem. 


III. Problem statement 


A. Preference learning with SVM 

We consider a learning to rank problem in IR for which the 
documents ranking according to queries is to be optimized. Let 
Q be the total number of queries and D the total number of 
documents in the training dataset T. Then, Q — {qk}k=i,...,Q 
is the set of all queries and T> — is the set of 

all documents. Each (query,document) pair is represented by 
a vector of features (j){qk, dt) G where d is the number of 
features. Let w G be the vector of weights of the learned 
model. We also define = {( 2 , j)ij=r,..., 7 vMi dj} as 
the subset of all the indices for which there is 

a preference between di and dj for the query qk. 

The optimization problem to be solved in pairwise SVM for 
ranking is defined as in lf22l : 


min - w 




i,j,k 


:j,k 


(3) 


under the constraints 

( y{di,dj) G {(j){qi,di) - (j){qi,dj)) > 1 - 


[ y{d^,dj) G rQ,w'^{(j){qQ,d^) - (j){qQ,dj)) > 1 - 

and 

yi,yj,yk,^ij^k > o 

We can reduce this problem to a classification problem. Let 
= {2'=}i=i,.,„Arand((z..) or (.,i))Gr-fe) be the subset of indices 


where Sip = x'^ — xf corresponds to a unique pair and X = 
[xi,...,xi]^ G is thus the matrix of all preferences 

pairs. Problem]^ is then equivalent to problem]^ and is written 
as a classification problem. By using the square hinge loss such 
as = max( 0,1 —xjw)^, the pairwise optimization problem 
finally is: 

1 P 

min aII'''^II 2 + max(0,1 — xjw)^ (5) 

The use of the square hinge loss in this context as been 
proposed by Chapelle and Keerthi ED for differentiability 
reasons when solving the pairwise problem in the primal. 

B. Sparse regularized SVM for preferences ranking 

To achieve feature selection in the context of SVM, a 
common solution is to introduce a sparse regularization term. 
We propose in the following to consider the Lasso formulation 
for feature selection, that combine £i-sparsity term and a 
square loss. 

In classification, the Lasso SVM solves the following opti¬ 
mization problem: 

n 

min llwlli+cX^ max(0,1 — t/iX.7w)^ (6) 

2=1 

According to equations 0 and 0 , we directly formulate 
the pairwise Lasso SVM by replacing the i 2 -terni by a £i-term 
as: 

p 

min Ijwlli+Cy^ max(0,1 — xjw)^ (7) 

The optimization problem for pairwise learning to rank with 
Lasso SVM is thus reduced to a Lasso classification problem 
on the matrix of preferences. One critical issue may arise 
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when using this formulation. Indeed, as the ^i-norm is not 
differentiable, this problem might be quite difficult to solve. 
However, a large panel of methods and algorithms have been 
proposed in classification in order to solve it. Thus, we argue 
that considering pairwise sparse SVMs is perfectly well suited 
to select features in learning to rank, for two reasons. 

Firstly, contrary to several proposed approaches such as 
GAS, sparse regularized SVM methods do not require extra 
developments of similarity and importance measures dedicated 
to learning to rank. Indeed, the feature selection is only based 
on the properties of the regularization term, so no additional 
assumptions are needed. Secondly, as they follow the SVM 
framework for classification, methods and algorithms used in 
classification can easily be adapted and applied to learning 
to rank with a few implementation efforts. In this paper, we 
propose to use an adaptation of a Fast Iterative Shrinkage 
Tresholding Algorithm in order to solve the sparse regularized 
optimization problem and to proceed to feature selection. 
We present this algorithm in the following section. We also 
propose to use non-convex regularization instead of -penalty 
in order to counter the statistical issues that may arise. We 
propose a second algorithms in the following section to deal 
with the non-convex regularizations. 


Algorithm 1 Accelerated FBS algorithm 
1 : Initialize w° 

2 : Initialize L as a Lipschitz constant of VJi(-) 
3: fc = 1, = 1 

4: repeat 

5: ^ ProxAfj(z'= - iVJi(z'=)) 

7: z'=+l ^ -f 

8 : k ^ k + 1 

9: untU Convergence 


where is a gradient step and L has to be a Lipschitz constant 
of VJi in order to ensure convergence. Note that one can 
easily compute the proximity operator of the £i-regularization 
that is of the following form; 

ProxA||.|u(w), = max (^ 0,1 - Wj 

= sign(wj)(|wj| - A)+ VjGl,...,d 

The weighted £i regularization has a similar proximity oper¬ 
ator 


IV. Learning preferences with sparse SVM 

In this section, we discuss the proposed methods for 
learning preferences with sparse SVM. Firstly, we introduce 
Forward-Backward Splitting algorithm, that is a well known 
approach for solving the non-differentiable weighted-£i reg¬ 
ularized problem. Secondly, this algorithm is adapted to the 
problem of preference learning and its convergence is proved. 
Finally we propose a general approach for solving the learning 
problem with non-convex regularization terms. 

A. Forward-Backward Splitting Algorithms for feature selec¬ 
tion 

Forward-Backward Splitting (FBS) algorithms have been 
proposed initially to solve non-differentiable optimization 
problems such as fi-norm regularized learning problems. A 
good introduction to this kind of algorithm is given in 041 . 
When minimizing a problem of the form; 

min Ji(w)-f AH(w) ( 8 ) 

where Ji(-) is a differentiable objective function with a Lips¬ 
chitz continuous gradient and H( ) is a convex regularization 
term having a closed form proximity operator, the proximity 
operator of regularization is defined as 

Prox^n(z) = argmin ^ ||z - w|||-f/ 2 H(w) (9) 

W Z 

FBS algorithms are iterative methods that compute at each 
iteration the proximity operator of the regularization term 
on a gradient descent step with respect to the differentiable 
function, thus leading to the following update 


ProxAn„(w)j = max (0,1 - ^ ) Wj 

\ \wj\J 

= sign(wj)(|wjj - A/3j)+ Vj G 1,... 

( 12 ) 

This algorithm, also known as Iterative Shrinkage Thresh¬ 
olding Algorithms (ISTA), has been proposed to solve linear 
inverse problems with -regularization as presented in iTSll . In 
their paper, Bech and Teboulle also address one limitation of 
this kind of approach; speed of convergence. Although these 
algorithms are able to deal with large-scale data, they may 
converge slowly. They proposed to use a multistep version 
of the algorithm called Fast Iterative Shrinkage Thresholding 
Algorithms (FISTA) that will converge more quickly to the op¬ 
timal objective value. This algorithm can be seen in algorithm 

m 

B. FBS for sparse preference learning 

In this section, we discuss how we adapted the FISTA 
algorithm to the problem of preference learning with £\ and 
weighted -regularized SVM. 

First we note that problem 0 is a sum of a differen¬ 
tial function, the data fitting loss and a non-differentiable 
^i-regularization. We then solve the equivalent problem as 
in with il(w) = ||w||i, A = l/C and Ji(w) = 

max(0,1 - xjw)2. 

In order to ensure convergence of the algorithm, the cost 
function Ji(w) must have a Lipschitz continuous gradient. 
Then, we just have to prove proposition [T] 

Proposition 1. Let Ji(w) the square Hinge loss 

p 

Ji (w) = ^ max(0,1 — Xp w)^ 

p=i 


w 


k+l 


= Prox ; 


- -VJi(w'' 


( 10 ) 
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Then its gradient 

p 

V Ji (w) = —2 ^ Xp max(0,1 — w) 

p=i 

is Lipschitz and continuous. 

Proof: The squared Hinge loss is gradient Lipschitz if 
there exists a constant L such that: 

|jV Ji(wi) — V Ji(W2)||2 < T||wi — W2II2 Vwi, W2 € d. 

The proof essentially relies on showing that x^ max(0,1 — 
x^w) is Lipschitz itself i.e there exists L' gM. such that 

Ijx, max(0,1 — x7Wi) — Xj inax(0,1 — x 7 w 2 )|| 

< L'||wi — W 2 II 

Now let us consider different situations. For a given wi and 
W 2 , if 1 — xfwi < 0 and 1 — xfw 2 < 0, then the left hand 
side is equal to 0 and any L' would satisfy the inequality. If 
1 — xj^wi < 0 and 1 — Sif W 2 > 0, then the left hand side 
(Ihs) is 

Ihs = ||Xi||2(l - x 7 w 2 ) 

< ||xi||2(x.7wi-x^Wa) 

< ||Xi|| 2 ||wi - W2II2 

A similar reasoning yields to the same bound when 1 — 
xfwi > 0 and 1 — xfwi < 0 and 1 — xfw 2 > 0 and 
1 — xfw 2 > 0. Thus, Xi max(0,1 — x^w) is Lipschitz with 
a constant ||x,;|p. Now, we can conclude the proof by stating 
that VwT is Lipschitz as it is a sum of Lipschitz function and 
the related constant is ^^=1 * 

We thus have proved than Ji (w) has a Lipschitz continuous 
gradient, which ensure the convergence of the algorithm. The 
gradient of Ji (w) is easy to compute and can be used as it in 
the FISTA algorithm. We thus can use the FISTA algorithm to 
solve the and weighted-^i-regularized SVM problems. We 
called this algorithm RankSVM-^i. 

C. Algorithm for non-convex regularization 

When using a non-convex regularization term as presented 
in equation 0, the previous algorithm cannot be used. We 
propose in this case to adapt a general purpose framework 
that has been proposed in The main idea behind this 
framework is to cast the regularization term as a difference 
of two convex functions. Convergence to a stationary point 
has been proven on this particular class of problems when 
performing a primaFdual optimization 1 ^ . 

The approach introduced in OOl can also be seen as 
a Majorization Minimization method lEl. Indeed, one can 
clearly see in Figure that all of the proposed regularization 
terms are concave in their positive orthant. This implies that 
for a hxed point ug > 0 

Vm > 0, g{u) <g{uo)+g'{uo){u-uo) 

The algorithm consists in minimizing iteratively the majoration 
of the cost function. When removing the constant term the 


Algorithm 2 Solver for non-convex regularization 
1 : Initialize w° and k = 1 
2: Initialize (3j = l,\/j 

3: repeat 

4: •(— Minimize maiorization (fTSll using algorithm 111 

5: Idj ^ g'{\w^\),yj 

6: k i — /l -j- 1 

7: until Convergence 


TABLE II: Derivatives of the nonconvex regularization terms 


Reg. term 

9'(l«*jl) 

£p,p < 1 

log 

MCP 

l/(e+ \wj\) 

max(l — |it;j |/( 7 A), 0) 


optimization problem for iteration fc -F 1 is 

= arg min Ji(w) -f X'S^ fdAwjl (13) 
Weis'* ^' 

% 

where fii = g'{\w^\) is computed using the solution at 
the previous iteration. This approach is extremely interesting 
in our case as we can readily use the efficient algorithm 
proposed for the weighted ii regularization. Moreover, one 
can use a warm-starting scheme for initializing the solver 
at the previous iteration. The resulting algorithm is given in 
algorithm and the derivative functions p'( ) for the non- 
convex regularizations are given in table [I^ 

V. Experimental framework 

A set of numerical experiments have been conducted on 
benchmark datasets to evaluate the performance of the frame¬ 
work we proposed. In this section, we provide a full descrip¬ 
tion of the datasets and the measures used. We also present 
the experimental protocol. 

A. Datasets 

We conduct our experiments on Letor 3.0 and Letor 4.0 
collections. These are benchmarks dedicated to learning to 
rank. Letor 3.0 contains seven datasets: Ohsumed, TD2003, 
TD2004, HP2003, HP2004, NP2003 and NP2004. Letor 4.0 
contains two datasets, MQ2007 and MQ2008. Their charac¬ 
teristics are summarized in table [nl] Each dataset is divided 
into hve folds, in order to perform cross validation. Eor each 
fold, we dispose of train, test and validation sets. 

B. Evaluation measures 

We evaluate the ranking performance of our approach using 
the Mean Average Precision (MAP) and the Normalized 
Discounted Cumulative Gain (NDCG). MAP is a standard 
evaluation measure in information retrieval that works with 
binary relevance judgements: relevant or not relevant. It is 
based on the computation of the precision at the position k 
which represents the fraction of relevant documents at the 
position k in the ranking list for a query q: 

:j(^:relevant documents within the k top documents 
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TABLE III: Characteristics of Letor 3.0 and Letor 4.0 distributions. 


Dataset 

features 

queries 

Number of 
query-document pairs 

preferences 

Average number (per query) of 
documents relevant documents 

TD2003 

64 

50 

49058 

398562 

981.1 

8.1 

NP2003 

64 

150 

148657 

150495 

991 

1 

HP2003 

64 

150 

147606 

181302 

984 

I 

TD2004 

64 

75 

74146 

1079810 

988.6 

14.9 

NP2004 

64 

75 

73834 

75747 

992.1 

1.1 

HP2004 

64 

75 

74409 

80306 

984.5 

I 

Ohsumed 

45 

106 

16140 

582588 

152.2 

45.6 

MQ2007 

46 

1692 

69623 

404467 

41.1 

10.6 

MQ2008 

46 

784 

15211 

80925 

19.4 

3.7 


The Average Precision (AP) at the position k is then defined 
for the query q as: 

^^2—1 '^^^■^{document i is relevant} 

^relevant documents for the query q 

The MAP is defined as the average of AP over all the 
queries: 


^ 9=1 


Q 


Unlike MAP, the NDCG can deal with more than two levels 
of relevance. Let r{i) be the relevance level of the document at 
position i. Given a query q, the Discounted Cumulative Gain 
at position k is defined as: 


k 

DCGg@k = 

i=l 


2 ’' 0 ) _ 1 
log2{i + 1 ) 


DCG can take values greater than 1. A normalization term is 
then introduced to set values from 0 to 1 : 

NDCGg@k = -^DGGg@k 

Zk 

where Zk is the maximum value of DCG@k. 

We also evaluate the ability of our approach to promote 
sparsity. To this purpose, we compute the sparsity ratio, 
which is the fraction of remaining features in the model after 
selection. For each fold / G 1,. ■., Nj- of the dataset T, Nj- 
the number of folds, we define the sparsity ratio SRf as: 

^remaining features in the learned model 
^ ^features of the given dataset 


We do not consider features that are zero for all the queries. 
Thus, the total number of features of a given dataset can be 
smaller that indicated in table |I^ The sparsity ratio of the 
algorithm A for a given dataset T is the average of SR over 
all the folds: 

Nr 
' /=! 


C. Experimental protocol 

For each dataset, we first train the algorithms on the training 
set with different values of C on a grid. For each fold, the G 
value that leads to the best MAP performance on the validation 
set is chosen. The model trained with this G value is used 
for prediction on the test set. We compute the MAP and the 
NDCG@10 on the test dataset. We then compare the convex 


algorithm and the non-convex algorithm to the state-of-the- 
art methods FenchelRank and FSMRank. We do not compare 
our method to the GAS algorithm, since it has been proven 
to be outperformed by the FSMRank algorithm ifTOll . We run 
the Windows/MsDos FenchelRank executable provided on the 
author’s personal web pag^and the matlab code of FSMRank 
provided by the authors on demand. We use the same grid as 
the authors to tune the parameters. Please note that since we 
use the MAP instead of the NDCG@10 to choose the optimal 
value of r on the validation, we obtain different models and 
results than in 13 and Eol. Finally, we set 7 = 2 for the 
MCP penalty and e = 0.1 for the log penalty, that are values 
commonly used in the community. 

For each experiment, we use the paired one-sided Student 
test in order to evaluate the significance of our results. A result 
is significantly better than another if the p-value provided by 
the Student test is lower than 5%. Results on performance in 
terms of sparisty ratio are illustrated by a spider (or radar) plot. 
Spider plots allow us to easily compare the behavior of several 
algorithms on several datasets according to a given measure. 
Each branch of the plot represents a dataset while each line 
stands for an algorithm. 

VI. Results and discussion 

In this section, we compare our convex and non-convex 
frameworks to state-of-the-art methods. Firstly, we analyze the 
performance of the non-convex framework in terms of sparsity 
ratio. Scondly, we show that using non-convex regularizations 
leads to similar results both in terms of MAP and NDCG@10. 
Finally, we confront the sparsity ratio and the performance 
in terms of IR measures to demonstrate that non-convex 
regularizations are truly competitive compared to state-of-the- 
art approaches. 

A. Sparsity ratio 

As we stated in the introduction, feature selection is a key 
issue in learning to rank. We aim at providing effective meth¬ 
ods that can learn high quality models while automatically 
selecting a few number of highly informative features. The 
main goal of using non-convex regularizations is to sharply 
reduce the amount of features used in ranking models. In 
this section, we analyze the sparsity ratio we obtain by using 
non-convex penalties and we compare them with the two 
algorithms FenchelRank and FSMRank and our G algorithm. 

^scholat.com/'hanjiang. Last visited on 12/09/2012 
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Table presents the sparsity ratio obtained with Fenchel- 
Rank, FSMRank, the £iregularization and the three non- 
convex penalties log, MCP and £ 0 . 5 ■ We restrain our analysis 
to this value of p for readability reasons. Figure [^presents the 
spider plot of 1— sparsity ratio, that is the ratio of removed 
features. The larger this measure is, the better the algorithm 
is in order to induce sparsity into models. 

We hrstly observe on both table |IV] and hgure that, in 
average, methods that use convex penalty are not as sparse as 
those using non-convex regularizations. Methods that use the 

penalty are the less sparse. In particular, FSMRank leads 
to higher sparsity ratio on most of the datasets, which means 
that the learned models contain much more features than those 
learned by the other methods. The MCP penalty appears to be 
the less sparse of the non-convex penalties. This is not really 
surprising since the MCP penalty has been initially proposed, 
not as a feature selection approach, but as a way to minimize 
the bias induced by regularization. 

When considering the average sparsity ratio, the use of log 
and fp penalties makes sense. These two non-convex penalties 
lead to the smallest sparsity ratio. The learned models select in 
average half of the features used by convex regularizations on 
all datasets. When considering each dataset independently, log 
and fp penalties select up to twelve times less features than 
the convex ones. These penalties are then truly performant 
methods to achieve feature selection. The log penalty is par¬ 
ticularly effective for inducing sparsity on HP2003, TD2003 
and the MQ datasets. For these later datasets, it selects from 
around six to twelve times fewer features than the state-of- 
the-art algorithms. The log penalty is the most effective for 
inducing sparsity on Ohsumed, HP2004 and NP2004 datasets. 
It can frequently selects from quarter ot half less features than 
the convex regularizations. More precisely: 

• on Ohsumed, the log penalty selects from half to third as 
many features than convex and MCP penalties, 

• on MQ2008, there were four to six times fewer features 
used by ^ 0.5 than by convex or MCP regularizations, 
while the log penalty selects two to three times fewer 
features, 

• on MQ2007, the log penalty selects half as many features 
as convex and MCP penalties while ^ 0.5 selects from ten 
to twelve times fewer features that convex regularizations 
and MCP, 

• on FIP2004, the two non-convex penalties use from a 
quarter to an half as many features as MCP and convex 
regularizations, 

• on NP2004, the log penalty selects from an half to a third 
as many features as MCP and convex regularizations, 

• on TD2004 and HP2003, there were two to three times 
fewer features used by fo.s than by MCP and convex 
regularizations, 

• on TD2003, the non-convex £ 0.5 penalty use half as many 
features as MCP and convex regularizations. 

Non-convex penalties are then shown to be very competitive 
methods when considering the number of selected features. 
The difference of sparsity ratio observed between datasets is 
due to the intrinsec difference between datasets. Ohsumed 


TABLE IV: Comparison of sparsity ratio between convex and 
non-convex regularizations and state-of-the-art algorithms. 


Dataset 

Fenchel 

FSM 


MCP 

log 

^ 0.5 

Ohsumed 

0.23 

0.41 

0.28 

0.34 

0.12 

0.22 

MQ2008 

0.3 

0.42 

0.39 

0.27 

0.09 

0.07 

MQ2007 

0.58 

0.64 

0.6 

0.19 

0.29 

0.05 

HP2004 

0.19 

0.26 

0.17 

0.24 

0.06 

0.12 

NP2004 

0.27 

0.37 

0.43 

0.32 

0.13 

0.23 

TD2004 

0.46 

0.67 

0.42 

0.28 

0.36 

0.3 

HP2003 

0.27 

0.48 

0.25 

0.39 

0.27 

0.16 

NP2003 

0.23 

0.44 

0.48 

0.36 

0.23 

0.27 

TD2003 

0.53 

0.76 

0.48 

0.40 

0.34 

0.20 

Mean 

0.34 

0.49 

0.39 

0.31 

0.21 

0.18 



Fig. 2: Ratio of removed features for each algorithm and 
regularization on Letor 3.0 and 4.0 corpora. 

dataset, Letor 4.0 and Letor 3.0 collections do not all use 
the same features. Althought features are similar for HP, NP 
and TD datasets, those are not related to similar retrieval 
tasks. The amount of relevant features may vary from a 
kind of datasets to an other, so does the relevant features 
themselves. Nevertheless, we may reasonbly expect to select 
same features on datasets related to the same tasks. The 
different performance in terms of sparsity ratio of a given 
algorithm among the datasets should not be seen as drawback 
of the method, but as the specihcity of the dataset. 

Removing a large amount of features may not be accurate if 
it leads to a degradation of prediction quality. In the following 
section, we compare the performance in terms of IR measures 
between non-convex and convex regularizations. 

B. Performance in terms of IR measures 

In this section, we confront the prediction of our proposed 
frameworks to those of the two state-of-the-art algorithms 
FenchelRank and FSMRank. Table (respectively table VII 
indicates the algorithm that leads to the best value of MAP 
(respectively NDCG@10). Each algorithm is also compared 
with the best algorithm by using the unilateral one-sided Stu¬ 
dent test. If a signihcant decrease is observed, the percentage 













TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. X, NO. X, DECEMBER 2012 


10 


of degradation and the p-value are indicated. If no significant 
variation is observed, the two algorithms are considered as 
equivalent and the ^ symbol is used. When considering MAP 
and NDCG values, one can noticed that some algorithms 
performs better on some datasets than on another. This is not 
specific to our methods and had already been observed for 
learning-to-rank algorithms IT]. 

1) General results: All the algorithms tend to provide 
similar results in terms of both MAP and NDCG@10 on the 
several datasets. Nevertheless, some differences in terms of 
performance can be observed among the algorithms, especially 
on HP2004 and NP2004 datasets on which we notice the 
largest variations of MAP and NDCG@10. A more detailled 
study is conducted in order to determine whether this differ¬ 
ence are significant. 

2 ) MAP analysis: We notice that on four datasets including 
Ohsumed, MQ2008, NP2004 and TD2003, all the algorithms 
provide similar results than the best algorithm. On the other 
five datasets, we observed that some convex and non-convex 
algorithms can lead to some degradation of the evaluation 
measure. We observe narrow decreases (less than 1%) in half 
of the cases. Limited (between 3% to 4%) and higher (up to 
11%) decreases also occured. A deeper analysis follows. 

FSMRank provides the higher value of MAP on the 
MQ2007 dataset. Our proximal approach with ii regular¬ 
ization is the only one that leads to equivalent results. We 
observed a very narrow decrease of the MAP when using 
FenchelRank (-0.8%) and our reweighted framework with 
MCP and log penalties (-0.5% and -0.7% respectively). The 
use of the ip regularization leads a 3% degradation of the 
MAP, which is still reasonable. 

FenchelRank provides the best MAP results on HP2004, but 
all the other algorithms and regularizations provide compara¬ 
ble results, except the log penalty. The framework using the 
log regularization leads to a degradation of 11 % of the MAP, 
so the use of this penalty might not be a good choice in term 
of MAP performance on this particular dataset. When using 
non-convex penalties, the ip or MCP ones should be prefered 
on this dataset. 

On TD2004 dataset, we observe a significant degradation of 
the MAP only when considering the mcp regularization. The 
other non-convex penalties and the ii regularization lead to 
results equivalent to the best algorithm. On NP2003 dataset, 
all the algorithms lead to similar results, except when using 
the ip penalty, for which a small variation is observed. 

Finally, we notice that all the non-convex penalties provide 
as good results as our ii algorithm, for which the MAP is the 
highest. FSMRank is the only one for which a degradation is 
observed. 

All in all, the framework we proposed leads to competitive 
results in terms of MAP. The ii algorithm is the best on one 
dataset and provides equivalent results on all other datasets. 
The MCP regularization leads to the best MAP values on two 
datasets and is equivalent to the best method on five datasets. 
The log and ip penalties provides results similar to the best 
algorithm on seven datasets. 

3) NDCG@ 10 analysis: When considering the NDCG@10, 
we observe that most of the algorithms provide similar re¬ 


sults on most of the datasets. Highest NDCG@10 values 
are obtained by FenchelRank on two datasets, FSMRank on 
two datasets, our ii framework on three datasets and our 
framework with non-convex log penalty on two datasets. Nev¬ 
ertheless, all the algorithms provide similar results compared 
to the best algorithm on seven datasets, including Ohsumed, 
MQ2008, NP2004, TD2004, HP2003, NP2003 and TD2003. 
Thus, our framework leads to similar performance than the 
state-of-the-art algorithms and can provide highest NDCG@10 
values, althought the increase is not significant. 

As we notice for the MAP, FSMRank provides the best val¬ 
ues on the MQ2007 dataset. We observe significant decreases 
for all the others algorithms, althought the degradations are 
narrow (less than 1% in most cases). On the HP2004 datasets, 
our framework performs as well as the best algorithm, except 
for the log penalty. In this last case, we observe a 4% decrease 
fo the NDCG@10. 

Experiments thus show that our convex and non-convex 
frameworks provide similar results than convex state-of-the- 
art algorithms in most cases. They can lead to higher MAP 
or NDCG values, althought the increase is not statistically 
significant. 

C. Discussions 

In previous sections, we analyze independently the ability 
of our framework to select only a few number of features 
and their performance prediction. We showed that non-convex 
penalties are competitive to reduce the number of features used 
by the learned models. We also pointed out that non-convex 
penalties leads to similar results than the best algorithms on 
most datasets. 

Figure plots the MAP values against the sparsity ratio 
for three representative datasets. For each dataset, the average 
values of MAP among all the algorithms is represented by a 
dotted line. We restrain the number of datasets for readability 
reasons. Figure show the use of non-convex regularizations, 
especially the ip and log penalties, are highly competitive 
feature selection methods, both in terms of sparsity and 
prediction quality. Indeed, they achieve MAP and NDCG@10 
performances that are similar to state-of-the-art convex algo¬ 
rithms, while selected half as many features in average on 
the datasets. On most datasets, the log and ip penalties are 
the methods that select the smaller number of features, while 
the MAP remains stable. They can select up to six times 
fewer features than the other convex algorithms, without any 
significant degradation of evaluation measures. 

In the few cases where a significant decrease is observed, 
the degradation is usually narrow. On MQ2007 dataset, the 
MAP and NDCG@10 degradation observed when using the 
log penalty is less than 1%. It is similar to those obtained 
with convex algorithms, whereas the log penalty selects half 
as many features as the convex algorithms. On the same 
dataset, we observe a 3% decrease of the MAP and a 5.6% 
decrease of the NDCG@10 when using the ip penalty, but 
the algorithm select up to twelve times less features than the 
convex approaches, and up to four times fewer features than 
the other non-convex algorithms. 
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TABLE V: Comparison of MAP between the best method on 
each dataset and others algorithms. Best MAP is in bold. The 
^ symbol indicates equivalence between two methods. Per¬ 
centage of decrease is presented when statistically significant 
under the 5% threshold (p-values in italics) 


Dataset 

Fenchel 

FSM 


MCP 

Log 

^0.5 

Ohsumed 




0,4513 



MQ2008 

0,4804 






MQ2007 

-0.8% 

0.02 

0.4672 


-0.5% 

0.03 

-0.7% 

<0.001 

-3% 

0.02 

HP2004 

0.7447 




-11% 
0.01 


NP2004 

0.6946 






TD2004 


0.2327 


-4.7% 

0.039 



HP2003 


-4% 

0.008 

0.7638 




NP2003 


0.6823 




-4% 

0.042 

TD2003 




0.2670 




On HP2004, we observe a degradation of 11 % of the MAP 
and of 4% of the NDCG when using the log penalty, but this 
method presents the better sparsity ratio, and selects a quarter 
as many features as state-of-the-art methods. On an other hand, 
the £p penalty provides as good results as the best method 
while selecting 37% less features than the best algorithm. On 
NP2003, we observe a degradation only for the £p penalty, 
whereas the log penalty provides similar results than convex 
methods and uses half as many features as the best algorithm 
FSMRank and the same number of features than FenchelRank. 
On this very particular dataset, non-convex methods do not 
perform as well as on the others, which may be due to the 
specificity of this dataset. 

Moreover, we do not tune the specihc parameter of non- 
convex regularizations, but set them to default values that are 
usually used by the community. Results may be improved by 
an appropriate tuning of these parameters. 

As a conclusion, the framework we proposed is able to 
provide similar results in terms of quality prediction com¬ 
pared to state-of-the-art approaches, while selecting half as 
many features. They are then competitive methods for feature 
selection in learning to rank. 

VII. Conclusion and Perspectives 

In this work, we presented a general framework for feature 
selection in learning to rank, by using SVM with sparse 
regularizations. We first proposed an accelerated proximal 
algorithm to solve the convex ii regularized problem. This 
algorithm has the same theoretical convergence rate than 
the state-of-the-art FenchelRank and FSMrank algorithm. We 
showed that a reweighted £i scheme can be used in order to 
solve non convex problems. This scheme is implemented into 
a second algorithm that solved problems with MCP, log and 
£p,P<l penalties. To the best of our knowledge, it is the hrst 
work that propose to consider non-convex penalties for feature 


TABLE VI: Comparison of NDCG@10 between the best 
method on each dataset and others algorithms. Best MAP 
is in bold. The ~ symbol indicates equivalence between two 
methods. Percentage of decrease is presented when statistically 
significant under the 5% threshold (p-values in italics) 


Dataset 

Fenchel 

FSM 


MCP 

Log 

G.5 

Ohsumed 





0.4591 


MQ2008 


0.2323 





MQ2007 

-1% 

0.043 

0.4445 

-0.7% 

0.042 

-0.9% 

0.013 

-0.5% 

0.0045 

-5.6% 

0.0016 

HP2004 

0.8274 




-4% 

0.044 


NP2004 



0.8148 




TD2004 





0.3153 


HP2003 

0.8283 






NP2003 



0.7839 




TD2003 



0.3538 





selection in learning to rank We conducted experiments on 
two major benchmarks in learning to rank that include nine 
different datasets on which we evaluate the performance in 
terms of MAP and NDCG@10. We also evaluate the ability 
of our framework to induce sparsity into models. 

We pointed out that the non-convex penalties lead to similar 
prediction quality, whatever the evaluation measure is, while 
using only half as many features as convex methods. Our 
framework is then a novel, competitive and effective embedded 
method for feature selection in learning to rank. Its originality 
is to consider non-convex regularizations in order to induce 
more sparsity into models without degradation of the pre¬ 
diction quality. Moreover, we will provide publicly available 
software for the two proposed algorithms in order to promote 
reproducible research. 

This work and the contributions of Sun et al. H, Lai et al. 
0 m show the effectiveness of embedded methods in the 
field of feature selection for learning to rank. More specifically, 
the use of sparse regularized SVM seems to be a promising 
way to handle the issue of feature selection and dimensionality 
reduction in learning to rank. To the best of our knowledge, 
our work is the first that propose a feature selection framework 
for learning to rank that is not restricted to the use of £i- 
regularization. A wide range of issues still need to be explored. 
In future works, we plan to evaluate the impact of tuning the 
non-convex regularizations parameters on both sparsity and 
prediction quality. A large study of the computational times 
of the sparse leaning to rank algorithms could be conducted. 
Finally, as feature selection can be used in order to learn 
ranking function specific to subset of queries, one of the most 
promising direction of work is the held of multitask learning. 
We plan to investigate the potential of a sparse regularized 
SVM algorithm using a Fast Iterative Shrinkage Thresholding 
framework, to be confronted to existing multitask algorithm 

ESlESlllol. 
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Fig. 3: MAP sparsity ratio for three representative datasets. Dotted lines represented average MAP obtained with the different 
algorithms. 
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